Controlling a Target System

ABSTRACT

For controlling a target system, such as a gas or wind turbine or another technical system, a pool of control policies is used. The pool of control policies including a plurality of control policies and weights for weighting each control policy of the plurality of control policies are received. The plurality of control policies is weighted by the weights to provide a weighted aggregated control policy. The target system is controlled using the weighted aggregated control policy, and performance data relating to a performance of the controlled target system is received. The weights are adjusted based on the received performance data to improve the performance of the controlled target system. The plurality of control policies is reweighted by the adjusted weights to adjust the weighted aggregated control policy.

BACKGROUND

The control of complex dynamical technical systems (e.g., gas turbines,wind turbines, or other plants) may be optimized by data drivenapproaches. With that, various aspects of such dynamical systems may beimproved. For example, efficiency, combustion dynamics, or emissions forgas turbines may be improved. Additionally, life-time consumption,efficiency, or yaw for wind turbines may be improved.

Modern data driven optimization utilizes machine learning methods forimproving control policies (e.g., control strategies) of dynamicalsystems with regard to general or specific optimization goals. Suchmachine learning methods may outperform conventional control strategies.For example, if the controlled system is changing, an adaptive controlapproach capable of learning and adjusting a control strategy accordingto the new situation and new properties of the dynamical system may beadvantageous over conventional non-learning control strategies.

However, in order to optimize complex dynamical systems (e.g., gasturbines or other plants), a sufficient amount of operational data is tobe collected in order to find or learn a good control strategy. Thus, incase of commissioning a new plant or upgrading or modifying the plant,it may take some time to collect sufficient operational data of the newor changed system before a good control strategy is available. Reasonsfor such changes may be wear, changed parts after a repair, or differentenvironmental conditions.

Known methods for machine learning include reinforcement learningmethods that focus on data efficient learning for a specified dynamicalsystem. However, even when using these methods, it may take some timeuntil a good data driven control strategy is available after a change ofthe dynamical system. Until then, the changed dynamical system operatesoutside a possibly optimized envelope. If the change rate of thedynamical system is very high, only sub-optimal results for a datadriven optimization may be achieved since a sufficient amount ofoperational data may be never available.

SUMMARY AND DESCRIPTION

The scope of the present invention is defined solely by the appendedclaims and is not affected to any degree by the statements within thissummary.

The present embodiments may obviate one or more of the drawbacks orlimitations in the related art. For example, control of a target systemthat allows a more rapid learning of a control policy (e.g., for achanging target system) is provided.

Embodiments of a method, a controller, and a computer program productfor controlling a target system (e.g., a gas or wind turbine or anothertechnical system) by a processor are based on a pool of controlpolicies. The method, controller, or computer program product(non-transitory computer readable storage medium having instructions,which when executed by a processor, perform actions) is configured toreceive the pool of control policies, which includes a plurality ofcontrol policies, and to receive weights for weighting each of theplurality of control policies. The plurality of control policies isweighted by the weights to provide a weighted aggregated control policy.The target system is controlled using the weighted aggregated controlpolicy, and performance data relating to a performance of the controlledtarget system are received. The weights are adjusted by the processorbased on the received performance data to improve the performance of thecontrolled target system. The plurality of control policies isreweighted by the adjusted weights to adjust the weighted aggregatedcontrol policy.

One or more of the present embodiments allow for an effective learningof peculiarities of the target system by adjusting the weights for theplurality of control policies. Such weights may include much fewerparameters than the pool of control policies. Thus, the adjusting of theweights may use much less computing effort and may converge much fasterthan a training of the whole pool of control policies. A high level ofoptimization may thus be reached in a shorter time. For example, areaction time to changes of the target system may be significantlyreduced. Aggregating a plurality of control policies reduces a risk ofaccidentally choosing a poor policy, thus increasing the robustness ofthe method.

According to an embodiment, the weights may be adjusted by training aneural network run by the processor.

The usage of a neural network for the adjusting of the weights allowsfor an efficient learning and flexible adaptation.

According to a further embodiment, the plurality of control policies maybe calculated from different data sets of operational data of one ormore source systems (e.g., by training a neural network). The differentdata sets may relate to different source systems, to different versionsof one or more source systems, to different policy models, to sourcesystems in different climes, or to one or more source systems underdifferent conditions (e.g., before and after repair, maintenance,changed parts, etc.).

The one or more source systems may be chosen similar to the targetsystem, so that control policies optimized for the one or more sourcesystems are expected to perform well for the target system. Therefore,the plurality of control policies based on one or more similar sourcesystems are a good starting point for controlling the target system.Such a learning from similar situations is often denoted as “transferlearning.” Hence, much less performance data relating to the targetsystem are used in order to obtain a good aggregated control policy forthe target system. Thus, effective aggregated control policies may belearned in a short time even for target systems with scarce data.

The calculation of the plurality of control policies may use a rewardfunction relating to a performance of the source systems. That rewardfunction may also be used for adjusting the weights.

The performance data may include state data relating to a current stateof the target system. The plurality of control policies may be weightedand/or reweighted in dependence of the state data. This allows for amore accurate and more effective adjustment of the weights. For example,the weight of a control policy may be increased if a state is recognizedwhere the control policy turned out to perform well, and vice versa.

Advantageously, the performance data may be received from the controlledtarget system, from a simulation model of the target system, and/or froma policy evaluation. Performance data from the controlled target systemallows monitoring the actual performance of the target system and mayimprove the performance by learning a particular response characteristicof the target system. A simulation model of the target system alsoallows what-if queries for the reward function. With a policyevaluation, a Q-function may be set up, allowing an expectation value tobe determined for the reward function.

An aggregated control action for controlling the target system may bedetermined according to the weighted aggregated control policy byweighted majority voting, by forming a weighted mean, and/or by forminga weighted median from action proposals according to the plurality ofcontrol policies.

According to one embodiment, the training of the neural network may bebased on a reinforcement learning model, which allows an efficientlearning of control policies for dynamical systems.

For example, the neural network may operate as a recurrent neuralnetwork. This allows for maintaining an internal state enabling anefficient detection of time dependent patterns when controlling adynamical system. Many Partially Observable Markov Decision Processesmay be handled like Markov Decision Processes by a recurrent neuralnetwork

The plurality of control policies may be selected from the pool ofcontrol policies in dependence of a performance evaluation of controlpolicies. The selected control policies may establish an ensemble ofcontrol policies. For example, only those control policies may beselected from the pool of control policies that perform well accordingto a predefined criterion.

Control policies from the pool of control policies may be included intothe plurality of control policies or excluded from the plurality ofcontrol policies in dependence of the adjusted weights. This allowsimprovement of the selection of control policies contained in theplurality of control policies. So, for example, control policies withvery small weights may be removed from the plurality of control policiesin order to reduce a computational effort.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an exemplary embodiment including a target system anda plurality of source systems together with controllers generating apool of control policies; and

FIG. 2 illustrates the target system together with a controller ingreater detail.

DETAILED DESCRIPTION

FIG. 1 illustrates an exemplary embodiment including a target system TSand a plurality of source systems S1, . . . , SN. The target system TSand the plurality of source systems S1, . . . , SN may be gas or windturbines or other dynamical systems including simulation tools forsimulating a dynamical system. In one embodiment, the source systems S1,. . . , SN are chosen to be similar to the target system TS.

The source systems S1, . . . , SN may also include the target system TSat a different time (e.g., before maintenance of the target system TS orbefore exchange of a system component, etc.). Vice versa, the targetsystem TS may be one of the source systems S1, . . . , SN at a latertime.

Each of the source systems S1, . . . , SN is controlled by areinforcement learning controller RLC1, . . . , or RLCN, respectively.The reinforcement learning controllers RLC1, . . . , or RLCN are drivenby control policies P1, . . . , or PN, respectively. The reinforcementlearning controllers RLC1, . . . , RLCN may each include a recurrentneural network (not shown) for learning (e.g., optimizing the controlpolicies P1, . . . , PN). Source system specific operational data OD1, .. . , ODN of the source systems S1, . . . , SN are collected and storedin databases DB1, . . . , DBN. The operational data OD1, . . . , ODN areprocessed according to the control policies P1, . . . , PN, and thecontrol policies P1, . . . , PN are refined by reinforcement learning bythe reinforcement learning controllers RLC1, . . . , RLCN. The controloutput of the control policies P1, . . . , PN is fed back into therespective source system S1, . . . , or SN via a control loop CL,resulting in a closed learning loop for the respective control policyP1, . . . , or PN in the respective reinforcement learning controllerRLC1, . . . , or RLCN. The control policies P1, . . . , PN are fed intoa reinforcement learning policy generator PGEN that generates a pool Pof control policies including the control policies P1, . . . , PN.

The target system TS is controlled by a reinforcement learningcontroller RLC including a recurrent neural network RNN and anaggregated control policy ACP. The reinforcement learning controller RLCreceives the control policies P1, . . . , PN from the reinforcementlearning policy generator PGEN and generates the aggregated controlpolicy ACP from the control policies P1, . . . , PN.

The reinforcement learning controller RLC receives performance data PDrelating to a current performance of the target system TS (e.g., acurrent power output, a current efficiency, etc.) from the target systemTS. The performance data PD includes state data SD relating to a currentstate of the target system TS (e.g., temperature, rotation speed, etc.).The performance data PD is input to the recurrent neural network RNN fortraining of the recurrent neural network RNN and input to the aggregatedcontrol policy ACP for generating an aggregated control action forcontrolling the target system TS via a control loop CL. This results ina closed learning loop for the reinforcement learning controller RLC.

The usage of pre-trained control policies P1, . . . , PN from severalsimilar source systems S1, . . . , SN gives a good starting point for aneural model run by the reinforcement learning controller RLC. Withthat, the amount of data and/or time required for learning an efficientcontrol policy for the target system TS may be reduced considerably.

FIG. 2 illustrates one embodiment of the target system TS together withthe reinforcement learning controller RLC in greater detail. Thereinforcement learning controller RLC includes a processor PROC and, asalready mentioned above, the recurrent neural network RNN and theaggregated control policy ACP. The recurrent neural network RNNimplements a reinforcement learning model.

The performance data PD(SD) including the state data SD stemming fromthe target system TS is input to the recurrent neural network RNN and tothe aggregated control policy ACP. The control policies P1, . . . , PNare input to the reinforcement learning controller RLC. The controlpolicies P1, . . . , PN may include the whole pool P or a selection ofcontrol policies from the pool P.

The recurrent neural network RNN is adapted to train a weighting policyWP including weights W1, . . . , WN for weighting each of the controlpolicies P1, . . . , PN. The weights W1, . . . , WN are initialized byinitial weights IW1, . . . , IWN received by the reinforcement learningcontroller RLC (e.g., from the reinforcement learning policy generatorPGEN or from a different source).

The aggregated control policy ACP relies on an aggregation function AFreceiving the weights W1, . . . , WN from the recurrent neural networkRNN and on the control policies P1, . . . , PN. Each of the controlpolicies P1, . . . , PN or a pre-selected part of the control policiesP1, . . . , PN receives the performance data PD(SD) with the state dataSD and calculates from the performance data PD(SD) and the state data SDa specific action proposal AP1, . . . , or APN, respectively. The actionproposals AP1, . . . , APN are input to the aggregation function AF,which weights each of the action proposals AP1, . . . , APN with arespective weight W1, . . . , or WN to generate an aggregated controlaction AGGA. The action proposals AP1, . . . , APN may be weighted(e.g., by majority voting, by forming a weighted mean, and/or by forminga weighted median from the control policies P1, . . . , PN). The targetsystem TS is controlled by the aggregated control action AGGA.

The performance data PD(SD) resulting from the control of the targetsystem TS by the aggregated control action AGGA are fed back to theaggregated control policy ACP and to the recurrent neural network RNN.From the fed back performance data PD(SD), new specific action proposalsAP1, . . . , APN are calculated by the control policies P1, . . . , PN.The recurrent neural network RNN uses a reward function (not shown)relating to a desired performance of the target system TS for adjustingthe weights W1, . . . , WN in dependence of the performance data PD(SD)fed back from the target system TS. The weights W1, . . . , WN areadjusted by reinforcement learning with an optimization goal directed toan improvement of the desired performance. With the adjusted weights W1,. . . , WN, an update UPD of the aggregation function AF is made. Theupdated aggregation function AF weights the new action proposals AP1, .. . , APN (e.g., reweights the control policies P1, . . . , PN) by theadjusted weights W1, . . . , WN in order to generate a new aggregatedcontrol action AGGA for controlling the target system TS. The above actsimplement a closed learning loop leading to a considerable improvementof the performance of the target system TS.

A more detailed description of the embodiment is given below.

Each control policy P1, . . . , PN is initially calculated by thereinforcement learning controllers RLC1, . . . , RLCN based on a set ofoperational data OD1, . . . , or ODN, respectively. The set ofoperational data for a specific control policy may be specified inmultiple ways. Examples for such specific sets of operational data maybe operational data of a single system (e.g., a single plant,operational data of multiple plants of a certain version, operationaldata of plants before and/or after a repair, or operational data ofplants in a certain clime, in a certain operational condition, and/or ina certain environmental condition). Different control policies from P1,. . . , PN may refer to different policy models trained on a same set ofoperational data.

When applying any of such control policies specific to a certain sourcesystem to a target system, the target system may not perform optimallysince none of the data sets was representative for the target system.Therefore, a number of control policies may be selected from the pool Pto form an ensemble of control policies P1, . . . , PN. Each controlpolicy P1, . . . , PN provides a separate action proposal AP1, . . . ,or APN, from the performance data PD(SD). The action proposals AP1, . .. , APN are aggregated to calculate the aggregated control action AGGAof the aggregated control policy ACP. In case of discrete actionproposals AP1, . . . , APN, the aggregation may be performed usingmajority voting. If the action proposals AP1, . . . , APN arecontinuous, a mean or median value of the action proposals AP1, . . . ,APN may be used for the aggregation.

The reweighting of the control policies P1, . . . , PN by the adjustedweights W1, . . . , WN allows for a rapid adjustment of the aggregatedcontrol policy ACP, for example, if the target system TS changes. Thereweighting depends on the recent performance data PD(SD) generatedwhile interacting with the target system TS. Since the weighting policyWP has less free parameters (e.g., the weights W1, . . . , WN) than acontrol policy usually has, less data is used to adjust to a newsituation or to a modified system. The weights W1, . . . , WN may beadjusted using the current performance data PD(SD) of the target systemand/or using a model of the target system (e.g., implemented by anadditional recurrent neural network) and/or using a policy evaluation.

According to a simple implementation, each control policy P1, . . . , PNmay be globally weighted (e.g., over a complete state space of thetarget system TS). A weight of zero may indicate that a particularcontrol policy is not part of the ensemble of policies.

Additionally or alternatively, the weighting by the aggregation functionAF may depend on the system state (e.g., on the state data SD of thetarget system TS). This may be used to favor good control policies withhigh weights within one region of the state space of the target systemTS. Within other regions of the state space, the control polices may notbe used at all.

P_(i), i=1, . . . , N may denote a control policy from the set of storedcontrol policies P1, . . . , PN, and s may be a vector denoting acurrent state of the target system TS. A weight function f(P_(i),s) mayassign a weight W_(i) (of the set W1, . . . , WN) to the respectivecontrol policy P_(i) dependent on the current state denoted by s (e.g.,W_(i)=f(P_(i),s)). A possible approach may be to calculate the weightsW_(i) based on distances (e.g., according to a pre-defined metric of thestate space) between the current state s and states stored together withP_(i) in a training set including states where P_(i) performed well.Uncertainty estimates (e.g., provided by a probabilistic policy) mayalso be included in the weight calculation.

In one embodiment, the global and/or state dependent weighting isoptimized using reinforcement learning. The action space of such areinforcement learning problem is the space of the weights W1, . . . ,WN, while the state space is defined in the state space of the targetsystem TS. For a pool of, for example, ten control policies, the actionspace is only ten dimensional and, therefore, allows a rapidoptimization with comparably little input data and little computationaleffort. Meta actions may be used to reduce the dimensionality of theaction space even further. Delayed effects are mitigated by using thereinforcement learning approach.

The adjustment of the weights W1, . . . , WN may be carried out byapplying a measured performance of the ensemble of control policies P1,. . . , PN to a reward function. The reward function may be chosenaccording to the goal of maximizing efficiency, maximizing output,minimizing emissions, and/or minimizing wear of the target system TS.For example, a reward function used to train the control policies P1, .. . , PN may be used for training and/or initializing the weightingpolicy WP.

With the trained weights W1, . . . , WN, the aggregated control actionAGGA may be computed according to AGGA=AF(s,AP1, . . . , APN, W1, . . ., WN), with AP_(i)=P_(i)(s), i=1, . . . , N.

The elements and features recited in the appended claims may be combinedin different ways to produce new claims that likewise fall within thescope of the present invention. Thus, whereas the dependent claimsappended below depend from only a single independent or dependent claim,it is to be understood that these dependent claims can, alternatively,be made to depend in the alternative from any preceding or followingclaim, whether independent or dependent, and that such new combinationsare to be understood as forming a part of the present specification.

While the present invention has been described above by reference tovarious embodiments, it should be understood that many changes andmodifications can be made to the described embodiments. It is thereforeintended that the foregoing description be regarded as illustrativerather than limiting, and that it be understood that all equivalentsand/or combinations of embodiments are intended to be included in thisdescription.

1. A method for controlling a target system by a processor based on apool of control policies, the method comprising: receiving the pool ofcontrol policies, the pool of control policies comprising a plurality ofcontrol policies; receiving weights for weighting each control policy ofthe plurality of control policies; weighting the plurality of controlpolicies by the weights to provide a weighted aggregated control policy;controlling the target system using the weighted aggregated controlpolicy; receiving performance data relating to a performance of thecontrolled target system; adjusting the weights by the processor basedon the received performance data to improve the performance of thecontrolled target system; and reweighting the plurality of controlpolicies by the adjusted weights to adjust the weighted aggregatedcontrol policy.
 2. The method of claim 1, wherein adjusting the weightscomprises training a neural network run by the processor.
 3. The methodof claim 2, further comprising: receiving operational data of at leastone source system; and calculating the plurality of control policiesfrom different data sets of the operational data.
 4. The method of claim3, wherein calculating the plurality of control policies comprisestraining the neural network or a further neural network.
 5. The methodof claim 3, wherein calculating the plurality of control policiescomprises using a reward function relating to a performance of the atleast on source system, and wherein adjusting the weights comprisesusing the reward function for the adjusting of the weights.
 6. Themethod of claim 1, wherein the performance data comprises state datarelating to a current state of the target system, and wherein theweighting of the plurality of control policies, the reweighting of theplurality of control policies, or the weighting of the plurality ofcontrol policies and the reweighting of the plurality of controlpolicies depends on the state data.
 7. The method as claimed in claim 1,wherein receiving the performance data comprises receiving theperformance data from the controlled target system, from a simulationmodel of the target system, from a policy evaluation, or from anycombination thereof.
 8. The method of claim 1, wherein controlling thetarget system comprises determining an aggregated control actionaccording to the weighted aggregated control policy by weighted majorityvoting, by forming a weighted mean, by forming a weighted median fromaction proposals according to the plurality of control policies, or byany combination thereof.
 9. The method of claim 2, wherein the trainingof the neural network is based on a reinforcement learning model. 10.The method of claim 2, wherein the neural network operates as arecurrent neural network.
 11. The method of claim 1, wherein theplurality of control policies is selected from the pool of controlpolicies in dependence of a performance evaluation of control policies.12. The method of claim 1, wherein control policies from the pool ofcontrol policies are included into or excluded from the plurality ofcontrol policies in dependence of the adjusted weights.
 13. The methodof claim 1, wherein the controlling, the receiving of the performancedata, the adjusting, and the reweighting are run in a closed learningloop with the target system.
 14. A controller for controlling a targetsystem based on a pool of control policies, the controller beingconfigured to: receive the pool of control policies, the pool of controlpolicies comprising a plurality of control policies; receive weights forweighting each control policy of the plurality of control policies;weight the plurality of control policies by the weights to provide aweighted aggregated control policy; control the target system using theweighted aggregated control policy; receive performance data relating toa performance of the controlled target system; adjust the weights by theprocessor based on the received performance data to improve theperformance of the controlled target system; and reweight the pluralityof control policies by the adjusted weights to adjust the weightedaggregated control policy.
 15. In a non-transitory computer-readablestorage medium that stores instructions executable by one or moreprocessors to control a target system based on a pool of controlpolicies, the instructions comprising: receiving the pool of controlpolicies, the pool of control policies comprising a plurality of controlpolicies; receiving weights for weighting each control policy of theplurality of control policies; weighting the plurality of controlpolicies by the weights to provide a weighted aggregated control policy;controlling the target system using the weighted aggregated controlpolicy; receiving performance data relating to a performance of thecontrolled target system; adjusting the weights by the processor basedon the received performance data to improve the performance of thecontrolled target system; and reweighting the plurality of controlpolicies by the adjusted weights to adjust the weighted aggregatedcontrol policy.
 16. The non-transitory computer-readable storage mediumof claim 15, wherein adjusting the weights comprises training a neuralnetwork run by the processor.
 17. The non-transitory computer-readablestorage medium of claim 16, wherein the instructions further comprise:receiving operational data of at least one source system; andcalculating the plurality of control policies from different data setsof the operational data.
 18. The non-transitory computer-readablestorage medium of claim 17, wherein calculating the plurality of controlpolicies comprises training the neural network or a further neuralnetwork.
 19. The non-transitory computer-readable storage medium ofclaim 17, wherein calculating the plurality of control policiescomprises using a reward function relating to a performance of the atleast on source system, and wherein adjusting the weights comprisesusing the reward function for the adjusting of the weights.
 20. Thenon-transitory computer-readable storage medium of claim 15, wherein theperformance data comprises state data relating to a current state of thetarget system, and wherein the weighting of the plurality of controlpolicies, the reweighting of the plurality of control policies, or theweighting of the plurality of control policies and the reweighting ofthe plurality of control policies depends on the state data.